[WIP] plugin: add new jobtap plugin to track job usage across associations' jobs#770
Open
cmoussa1 wants to merge 3 commits intoflux-framework:masterfrom
Open
[WIP] plugin: add new jobtap plugin to track job usage across associations' jobs#770cmoussa1 wants to merge 3 commits intoflux-framework:masterfrom
cmoussa1 wants to merge 3 commits intoflux-framework:masterfrom
Conversation
Problem: As mentioned in flux-framework#650, there is a need to want to enforce a limit on an association's ability to run jobs on a certain instance type, which flux-accounting already tracks in its SQLite database by calculating job usage (which is a product of the number of nodes allocated to a job and its duration). However, there is currently no way to enforce this using flux-accounting's mf_priority jobtap plugin. Begin to lay the groundwork for this limit enforcement by adding a new jobtap plugin called compute_hours_limits, which for now, just tracks an association's current usage across all of their running jobs and adds the actual usage of the job to the association's total_usage value when jobs complete.
Problem: There is no way to send flux-accounting database information to the compute_hours_limits plugin. Add a command that extracts flux-accounting database information and packs it into JSON objects to be sent over to and unpacked by the compute_hours_limits plugin.
Problem: There are no tests for the compute_hours_limits plugin. Add some basic tests.
e3a7f50 to
3b4ed80
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #770 +/- ##
==========================================
- Coverage 83.56% 83.02% -0.54%
==========================================
Files 27 28 +1
Lines 2421 2615 +194
==========================================
+ Hits 2023 2171 +148
- Misses 398 444 +46
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
As mentioned by @vsoch in #650, there is currently no way to enforce a limit on an association's ability to run jobs on a certain instance type, which flux-accounting already tracks in its SQLite database by calculating job usage (which is a product of the number of nodes allocated to a job and its duration). However, there is currently no way to enforce this using flux-accounting's mf_priority jobtap plugin.
This PR begins to lay the groundwork for this "usage" limit enforcement by adding a new jobtap plugin (called
compute_hours_limits) which, for now, just tracks an association's current usage across all of their running jobs (by calculating the job's anticipated usage by multiplying the job'snnodesby its requested duration) and computes the jobs actual usage when the job completes. The workflow looks something like this:The user submits a job and specifies its size and duration (or a default duration is set on the job):
$ flux submit -N4 -S duration=3600 my_jobThe plugin takes these resource specifications and calculates an expected usage by multiplying the both of these numbers together:
When the job transitions to
RUN, theexpected_usageis added to the association'scurrent_usageattribute:When the job completes, the job's actual usage is calculated and added to the association's
total_usageattribute:and the expected usage from the job is subtracted from the association's
current_usageattribute:The associations'
total_usageattributes can be reset to0.0by sending a"clear"rpc to the plugin:To avoid making the plugin code a lot to review, I've only added tracking of an association's job usage in this PR. If this looks like the right way we want to track an association's usage (and eventually enforce a limit), I can submit follow-up PRs to add that functionality.
I've added some basic tests to showcase how the plugin tracks job usage for a given association by submitting one or multiple jobs and ensuring the
current_usageandtotal_usagevalues are calculated correctly.still some things to take care of
Jobobject after it has transitioned tojob.state.inactivetotal_usagevalue so it doesn't just increase indefinitely